-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
type-stable inner loop for sqrtm #20214
Conversation
As suggested by Ralph_Smith on [discourse](https://discourse.julialang.org/t/review-schur-pade-matrix-powers-speedup/1650/6) On my machine: speedup x15
Do we have a benchmark for this in BaseBenchmarks? |
I can't find a benchmark. What would that look like?
|
Take a look at https://github.com/JuliaCI/BaseBenchmarks.jl#contributing and https://github.com/JuliaCI/BaseBenchmarks.jl/blob/master/src/linalg/LinAlgBenchmarks.jl. The package will handle the timing. You should just provide the function. |
@@ -1888,10 +1898,7 @@ function sqrtm{T}(A::UpperTriangular{T}) | |||
for j = 1:n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we should just have sqrtm(A)
call a sqrtm{realmatrix}(A,::Val{realmatrix})
function with either Val{true}
or Val{false}
so that every operation on R
is type-stable, and then you won't need floop
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put @inbounds
in front of this loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done both 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just put the @inbounds
in front of the for ... end
group no need of the extra begin ... end
here. That is,
@inbounds for j = 1:n
# loop body
end
@@ -1900,14 +1907,10 @@ end | |||
function sqrtm{T}(A::UnitUpperTriangular{T}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we could replace this with
sqrtm{T}(A::UnitUpperTriangular{T}) = sqrtm(A, Val{true})
in terms of the abovementioned method. Given a Val
method that is type-stable, the only remaining reason to duplicate the code here seems to be to save the sqrt
call for the diagonal elements, but I doubt this is worth the trouble since there are only O(n) such calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will look at this tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since UnitUpperTriangular{T}
is not a subtype of UpperTriangular{T}
(??), I think it would be
sqrtm(A::UnitUpperTriangular) = UnitUpperTriangular(sqrtm(UpperTriangular(A),Val{true}))
For some reason this is faster than the present method, which confuses me. This version involves two additional conversions and some unnecessary arithmetic operations in the loop (of order O(n) as you mention)! Are some ops on UnitUpperTriangular
slower than the same op on UpperTriangular
?
Will make this change later unless someone can enlighten me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Julia doesn't support inheritance of concrete types.)
Indexing expressions A[i,j]
are slower for UnitUpperTriangular
than for UpperTriangular
, and both are slower than for Array
, because it has to check if i <= j
and (in the unit case) whether i == j
.
Since by construction you only access the upper triangle, I would do:
B = A.data
at the beginning of the function (for both the UnitUpperTriangular
and UpperTriangular
methods), and then use B[i,j]
rather than A[i,j]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(There might be other methods operating on triangular matrix types that might benefit from a similar change.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The @inbounds
means that it never checks where i
and j
are in 1:n
. It still checks whether i ≤ j
, since if i > j
then it just returns zero rather than looking at A.data
.
Timing such a small operation is tricky; I would normally recommend BenchmarkTools instead of @time
:
julia> using BenchmarkTools
julia> A = rand(10,10); T = UpperTriangular(A);
julia> @btime $A[3,5]; @btime $T[3,5];
1.377 ns (0 allocations: 0 bytes)
1.650 ns (0 allocations: 0 bytes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(It may be that the effect on the UpperTriangular
case is too small to measure, however. Or maybe even LLVM is smart enough to figure out that i ≤ j
is always true. But it wouldn't hurt to unwrap the type anyway.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After improving the type stability, the specialised implementation for UnitUpperTriangular
is faster than converting.
Now, on 0.5.0 using @benchmark
ing, I see a slowdown when unwrapping the type first in sqrtm
. Relatedly, on my machine (juliabox.com) there appears to be a slowdown for the atomic operations:
@benchmark $A[1,2]
median time: 2.650 ns
@benchmark $T[1,2]
median time: 2.223 ns
(what does the $
do here?)
This persists when wrapping these calls in a function. All in all I am hesitant to make the unwrapping change for sqrtm
(but not against)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you unwrap the type? I hope you didn't do A = A.data
, which hurts type stability.
I can't reproduce your benchmark results. Maybe the benchmarks aren't reliable for such a short operation, and one needs to put a bunch of indexing operations in a loop? (Probably you also want to time them with @inbounds
). The $
splices the value of the variable directly into the expression, which gets rid of the penalty for benchmarking with a global variable (it means the compiler does not have to do runtime type inference to find the type of A
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example:
julia> function sumupper(A)
s = zero(eltype(A))
for j = 1:size(A,2), i = 1:j
s += A[i,j]
end
return s
end
sumupper (generic function with 1 method)
julia> A = rand(100,100); T = UpperTriangular(A);
julia> using BenchmarkTools
julia> sumupper(A) == sumupper(T)
true
julia> @btime sumupper($A); @btime sumupper($T);
4.122 μs (0 allocations: 0 bytes)
5.102 μs (0 allocations: 0 bytes)
As suggested by @stevengj
@stevengj great idea, implemented 👍 In my testing, I found that @andreasnoack I'm preparing a PR for BaseBenchmarks 👍 |
@@ -1889,14 +1888,18 @@ function sqrtm{T}(A::UpperTriangular{T}) | |||
end | |||
end | |||
end | |||
sqrtm(A::UpperTriangular,Val{realmatrix}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ::UpperTriangular
type assertion isn't usually used on call sites
@@ -1879,19 +1888,20 @@ function sqrtm{T}(A::UpperTriangular{T}) | |||
end | |||
end | |||
end | |||
sqrtm(A,Val{realmatrix}) | |||
end | |||
function sqrtm{T,realmatrix}(A::UpperTriangular{T},::Type{Val{realmatrix}}) | |||
if realmatrix | |||
TT = typeof(sqrt(zero(T))) | |||
else | |||
TT = typeof(sqrt(complex(-one(T)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use sqrt(complex(zero(T)))
here, since at some point one
may return a dimensionless value if T
is dimensionful.
@@ -1867,7 +1867,16 @@ function logm{T<:Union{Float64,Complex{Float64}}}(A0::UpperTriangular{T}) | |||
end | |||
logm(A::LowerTriangular) = logm(A.').' | |||
|
|||
function sqrtm{T}(A::UpperTriangular{T}) | |||
function floop(x,R,i::Int,j::Int) | |||
r = x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not type stable if A[i,j]
and R[i,j]
are not the same type. A simple fix is r = x + zero(eltype(R))
@felixrehren, have you tried |
for j = 1:n | ||
R[j,j] = realmatrix ? sqrt(A[j,j]) : sqrt(complex(A[j,j])) | ||
for i = j-1:-1:1 | ||
r = A[i,j] + zero(TT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even simpler:
r::TT = A[i,j]
Though (like a lot of our generic linear-algebra code), this isn't right for dimensionful quantities, because r
has the units of A
whereas R
has the units of sqrt(A)
. If we want to get that right, it should really be something like:
TA = realmatrix ? float(T) : Complex{float(T)}
and then use r::TA
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, do you suggest replacing
TT = realmatrix ? typeof(sqrt(zero(T))) : typeof(sqrt(complex(zero(T))))
with
TT = realmatrix ? float(T) : Complex{float(T)}
or would TA
be supplemental to TT
? Because r
should have the units of R = sqrtm(A)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TA
would be supplemental to TT
, because A
and R
have different units.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of introducing a new type variable then you can simply do
r = A[i,j] - zero(TT)*zero(TT)
which matches the operation in the loop and therefore correct. We should avoid explicit float
calls if possible since it requires that float is implemented correctly for new types (which might not be the case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope this is addressed -- I guess we should write some tests? (for a different PR ...)
end | ||
n = checksquare(A) | ||
n = Base.LinAlg.checksquare(A) | ||
R = zeros(TT, n, n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~~My inclination would be to use R = Matrix{TT}(n, n)
here. Then get rid of the r == 0
check below. sqrtm
is unlikely to return a sparse result, so I don't see the point of pre-initializing the array to zero.~~~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nevermind, I forgot that this is for the upper-triangular case. Then I guess initializing it to zero makes sense.
@stevengj Thanks for the comments, I am also learning a lot! I think the AppVeyor failure is unrelated?
|
The "Partial linear indexing" thing was just fixed by #20242 |
replace `TT` by `t` for the type of the sqrt of a variable of type `T` introduce `tt` as the type of the square of a variable of type `t` N.B. `tt` is not always the same as `T`, it could be `Complex{T}` In the `UnitUpperTriangular` case, some of the complexity should fall away; TODO?
@simd for k = i+1:j-1 | ||
r -= R[i,k]*R[k,j] | ||
end | ||
r==0 || (R[i,j] = r/one) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one
won't be equivalent to the former (R[i,i] + R[j,j])
, will it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it equivalent to 2*R[1,1]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is for the UnitUpperTriangular
case only, R[i,i] = R[1,1]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right. Missed the UnitUpperTriangular
since that line was on the other side of a long conversation.
I know this is not a blocker for 0.6, but would be nice to get it in as we plan for an alpha release. Marking it 0.6 mainly as a reminder, but feel free to take off the 0.6 tag if necessary. |
Please don't put "nice to have" on the milestone. |
Think this is it -- @stevengj, @andreasnoack ? |
@simd for k = i+1:j-1 | ||
r -= R[i,k]*R[k,j] | ||
end | ||
r==0 || (R[i,j] = half*r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the old version of this function might not have worked correctly in this case either, but would this do the right thing if the element type was not commutative under multiplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above for the general case. However, even for non-commutative algebras the case here is special because R[i,i]
is the identity, which always commutes with everything. So, the problem reduces to solving R[i,j] + R[i,j] = r
. If the elements form a field over the reals, then the solution R[i,j] = r/2
is correct.
The case here is also wrong if e.g. the elements are matrices, in which case 1/(2*R[1,1])
will throw an error, although if you do half = inv(2*R[1,1])
it will work. However, it would be even better to just to R[i,j] = r/2
in that case, since it would avoid the matrix multiplication.
However, it is still wrong if the elements are some other number field where you can't do /2
. The correct, general solution seems like it should be
half = inv(R[1,1] + R[1,1]) # don't use 1/(2R[1,1]) or half=0.5, to handle general algebraic types
(Note that this PR can go in after the feature freeze since it is just a performance optimization.) |
@simd for k = i+1:j-1 | ||
r -= R[i,k]*R[k,j] | ||
end | ||
r==0 || (R[i,j] = r / (R[i,i] + R[j,j])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per @tkelman's comment below, for non-commutative number types (e.g. quaternions), this is not right.
I haven't gone through the algebra very carefully, but I think the final R[i,j]
needs to solve R[i,i]*R[i,j] + R[i,j]*R[j,j] == r
, i.e. a Sylvester equation. So, the right thing to do seems like:
R[i,j] = sylvester(R[i,i], R[j,j], -r)
and then add a method
sylvester(a::Union{Real,Complex},b::Union{Real,Complex},c::Union{Real,Complex}) =
(-c) / (a + b)
for the commutative Number
cases. New Number
types will have then to provide their own sylvester
method if they want sqrtm
to work. (We already have sylvester
methods for matrices thanks to #7435.)
sqrtm(A,Val{realmatrix}) | ||
end | ||
# solve the sylvester equation a*x + x*b + c for x when a,b,x are commutative numbers. PR#20214 | ||
sylvester(a::Union{Real,Complex},b::Union{Real,Complex},c::Union{Real,Complex}) = -c / (a + b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably this sylvester
definition should go into complex.jl or similar? And then you'll need to make sure that the LinAlg module imports Base.sylvester
so that it extends the base definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would parenthesize (-c)
. That way if you call sylvester(a, b, -r)
, the compiler will have a good shot at noticing that the two negations cancel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, it looks like LLVM is smart enough to eliminate the double negation either way:
julia> syl(a,b,c) = -c / (a + b)
syl (generic function with 1 method)
julia> foo(a,b,c) = syl(a,b,-c)
foo (generic function with 1 method)
julia> @code_llvm foo(1.0,2.0,3.0)
define double @julia_foo_65538(double, double, double) #0 !dbg !5 {
top:
%3 = fadd double %0, %1
%4 = fdiv double %2, %3
ret double %4
}
LGTM except that the definition of |
Also, you don't need to put |
@stevengj Ok, will do. Should I keep any of the comments or clean it all up? |
You should have a comment on code that does something in a non-obvious way, to make sure no one changes it without realizing, but your comments don't have to mention the PR. |
moved from triangular.jl
Think all comments so far are addressed, let me know if it could use further improvements |
LGTM. Is the overall speedup still 15x? |
With all the nice tweaks, I see x20 in this benchmarking notebook on Julia 0.5 |
As suggested by Ralph_Smith on discourse
On my machine: speedup x15